RGB565 saturated addition
In the previous two issues of Hugi, people have gradually sped up the operation of 16bit saturated addition. The latest version added two pixels at a time in a parallel fashion, where six colour components were combined using two ADDs; however, the components were clamped one at a time using SETcc instructions. Parallelization can also be achieved at this stage through pure arithmetic operations, as will be shown step-by-step in the text below.
As the readers of Hugi are already fluent with multiple operating systems, I am sure no one will object against me writing all example code in 680x0 assembly. :)
To illustrate the approach, let us take the example of performing saturated addition on the 4-bit value contained in bits 2-5 of the two source parameters. We will assume that the surrounding bits in the register can be destroyed, because that will be the situation when we extend the operation to full RGB565 pixels.
We mask off the irrelevant bits:
; d0.b value0
; d1.b value1
and.b #$3c,d0 ; Only keep middle 4 bits
and.b #$3c,d1 ; Only keep middle 4 bits
After this, both registers are in the form %00xxxx00, that is, the two top and the two bottom bits are zero.
We add them together:
add.b d1,d0
The produced result will be on the form %0oxxxx00, with the o-bit indicating overflow. If the overflow bit is set, we want to set all x-bits to 1. Our task is thus to generate a bitmask which is either in %..0000.. or %..1111.. form, depending on whether the overflow bit is clear or set.
We isolate a copy of the overflow bit:
move.b d0,d1 ; Copy value
and.b #$40,d1 ; Isolate overflow bit
Our temporary value is either %00000000 or %01000000, depending on whether the overflow bit is clear or set. Shifting down the overflow bit to the bottommost position of the register, and then decrementing the register will give us a bitmask of %11111111 or %00000000 in either case, so we follow it with a bitwise not to get %00000000 or %11111111.
We create the saturation mask:
lsr.b #6,d1 ; Move overflow bit down to bitpos 0
subq.b #1,d1 ; Decrement, to make bitmask
not.b d1 ; Flip bitmask
By now we have a bitmask of %..0000.. or %..1111.., so we OR this onto the result of the addition, which performs the actual saturation.
We saturate the value:
or.b d1,d0
Some minor adjustments can be done to the above algorithm, to make it more useful for the RGB565 addition. One thing, is not to shift down the overflow bit to bit position 0, but to the lowest bit position of the mask, and flipping the overflow bit before the decrement instead of flipping the whole bitmask after said operation:
lsr.b #4,d1 ; Move overflow bit down to bitpos 2
eor.b #%00000100,d1 ; Flip overflow bit
subq.b #%00000100,d1 ; Decrement bit 2
Now the bits below the mask remain unaffected by the mask computation, except for the shift. How about the bits further above -- if we were doing this operation on, say, a longword?
It turns out that if we set a bit somewhere further up before the decrement is performed, the bit will act as a 'stop bit' and mark where the bitmask should end. If we start out with either %00000000 or %01000000, shift down the overflow bit to bit position 2 we get %00000000 or %00000100, set the stop bit and flip the overflow bit we get %01000100 or %01000000, and subtracting %100 finally yields a mask of %0100000000 or %00111100:
lsr.b #4,d1 ; Move overflow bit down to bitpos 2
eor.b #%01000100,d1 ; Flip overflow bit + set stop bit
subq.b #%00000100,d1 ; Decrement bit 2
Since the mask calculation now only affects a limited portion of the register, multiple masks can be calculated in parallel! We are now ready to tackle the task of performing RGB565 saturated addition.
Saturated addition of two longwords containing RGB565 pixels can be split up into a series of steps:
1. Separate the components of the pixels
Separation of the components provides room for possible overflow of each component; it enables us to read the overflow bit of each component, and avoids overflow in one component to affect the value of the next higher component.
; d0 colour0
; d1 colour1
move.l d0,d2
move.l d1,d3
and.l #$f81f07e0,d0 ; Keep R-B-G-
and.l #$07e0f81f,d1 ; Keep -G-R-B
eor.l d0,d2 ; Keep -G-R-B
eor.l d1,d3 ; Keep R-B-G-
2. Add the separate components
Since there now is some space between each component in each register, the additions can be performed safely. The overflow state of each component will be indicated by the bit above each component (the bit is essentially a carry-flag). Beware that the overflow-bit of the R component in d0/d3 will land in the X/Carry-flag, so it will need to be shifted in immediately during the next step.
add.l d2,d1 ; Add -G-R-B
add.l d3,d0 ; Add R-B-G-
3. Isolate overflow-bits
The results of the additions are copied over into other registers, and the overflow-bit of the components are isolated. Notice that one of the overflow-bits must be shifted in before any other arithmetic operations are performed.
move.l d1,d2 ; Copy -G-R-B
move.l d0,d3 ; Copy R-B-G-
roxr.l #6,d3 ; Shift in R-overflow-bit
and.l #$08010020,d2 ; Only keep overflow-bits
and.l #$04008020,d3 ; Only keep overflow-bits
4. Extend overflow-bits into full bitmasks
This is the new part. If a register contains a single bit somewhere, which is either 0 or 1, that bit can be extended into a full bitfield through a three-step operation: First a stop-bit is set somewhere further up in the register. This bit marks where the bitfield should end. Then the data bit is inverted, to give it the right sign. Finally (1 << BITPOS) is subtracted from the register. Notice that the final shift is an arithmetic shift -- this extends the uppermost bitmask while shifting.
eor.l #$08410820,d2 ; Flip overflow-bits & set stopbits
eor.l #$04208820,d3 ; Flip overflow-bits & set stopbits
sub.l #$08010020,d2 ; Make bitmasks
sub.l #$04008020,d3 ; Make bitmasks
asr.l #6,d2 ; Line masks up with the components
5. OR the bitmasks onto the result of the additions
or.l d3,d0
or.l d2,d1
6) Combine the components back into a longword
and.l #$f81f07e0,d0 ; Remove garbage
and.l #$07e0f81f,d1 ; Remove garbage
or.l d1,d0 ; Combine components
... and finally the saturation is done.
The above approach is easily modified to perform saturated subtraction -- generate bitmasks, which are applied using ANDs, to zero out any components that have underflowed.
A variation on the theme can also be used to do saturated addition/subtraction on ARGB8888 pixels: do not separate components, instead zero out the lowest bit of each byte. Each such bit will be the overflow-bit of the byte below. The entire saturation process takes approximately 7-8 instructions if done well.
Finally, people who implement this on Intel machines, please remember that the RCR instruction is very slow on the P5 architecture! Double normal shifts will run faster.